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ABSTRACT 

We apply belief propagation to a Bayesian bipartite graph 
composed of discrete independent hidden variables and dis¬ 
crete visible variables. The network is the Discrete counter¬ 
part of Independent Component Analysis (DICA) and it is 
manipulated in a factor graph form for inference and learn¬ 
ing. A full set of simulations is reported for character images 
from the MNIST dataset. The results show that the factorial 
code implemented by the sources contributes to build a good 
generative model for the data that can be used in various in¬ 
ference modes. 

Index Terms — Bayesian Networks; Belief Propagation; 
ICA; 

1. INTRODUCTION 

Bi-directional information flow in belief propagation net¬ 
works is becoming a very popular framework in many signal 
processing applications ID (2) because inference and learning 
can be easily manipulated with a small set of rules. Gener¬ 
ally Bayesian models aim at capturing the hidden structure 
that may underly observed data through the assumption of 
a network of random variables that are only partially, or 
occasionally, visible 131 . 

Independent Component Analysis (ICA) is a popular sig¬ 
nal processing framework in which observed data are mapped 
to, or generated from, independent hidden sources variables 
l4l . The variables are typically continuous and the transfor¬ 
mation between sources and visible variables is linear. ICA 
has been used in many applications for signal separation and 
for analyzing signals and images fH. ICA filters, trained on 
real images, seem to converge to patterns that resemble the 
receptive fields found in the neural visual cortex 0. 

In this paper we explore the possibility of using the gen¬ 
erative model of the ICA on discrete variables. The Bayesian 
model is constrained to a finite number of discrete hidden 
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sources (factorial code) that feed the visible variables, also 
discrete. Even if there are computational difficulties that nat¬ 
urally emerge in dealing with the product space of discrete 
alphabets, we find that even limiting our attention to tractable 
small sizes, the DICA framework clearly shows some poten¬ 
tial in the applications, perhaps as a building block of more 
complex architectures. Discrete Component Analysis (DCA) 
has also been discussed by Buntine et al. (71 with reference 
to different models. 

We reduce the DICA architecture to a Bayesian factor 
graph in the so-called reduced normal form (see m and refer¬ 
ence therein) that includes only simple interconnected blocks. 
We experiment with belief propagation on this architecture 
using images extracted from the MNIST dataset fT2\ . We 
show that the DICA network nicely converges after learning 
to a generative model that reproduces accurately the image 
set. 

In Section 2 the Bayesian model is presented and in Sec¬ 
tion 3 its discrete version is transformed into a factor graph 
for belief propagation. The various modes of inference are 
discussed in Section 5 and learning in Section 6. The simu¬ 
lations for unsupervised mapping of the MNIST images are 
reported in Section 6 with the addition of the label variable in 
Section 7. The conclusions are in Sections 8. 

2. THE BAYESIAN MODEL 



Fig. 1. The Bayesian Graph for M independent sources 




Fig. 2. The Bayesian Graph for M independent sources after 
the sources have been grouped (married). 

In this paper we focus on the generative model depicted 
as the bi-partite graph of Figure with M independent 
source variables 5'i,5'2, (hidden). The main vari¬ 

ables Xi, X 2 ,...., Xtv (visible), are connected to the source 
variables via the factorization 

p{X^X2...XnS^S2...Sm) = 

p{X^\SrS2...SM)p{X2\SrS2...SM) (D 

• -p{Xn\SiS2...Sm)p{Si)p{S2) • -p{Sm) 

Note that Xi, X 2 ,...., Xat to be conditionally independent, 
must be conditioned on the whole set of sources, even if 
their marginal distribution factorizes: p{SiS 2 ---Sm) = 
p{Si)p{S 2 ) • • • p{Sm)- This appears to be the most gen¬ 
eral model for independent hidden sources that underly a set 
of dependent variables Xi, X 2 ,...., X^r. When M = 1, the 
system degenerates into a single-variable latent model O . 

One way of solving for the probability functions involved 
in the Bayesian model is to group (marry) the source vari¬ 
ables (parents) m as in Figure Note that the Bayesian 
graph does not show that the source variables are marginally 
independent. This is made more explicit in the factor graph 
representation that will follow. 

2.1. Generative model for classical ICA 

Independent Component Analysis is obtained when all the 
variables xi, X 2 ,<§ 1 , 52 ,sm ^ ^ and the condi¬ 
tional probability density functions p{xi\siS 2 ...SM) are con¬ 
strained to depend on linear combinations of si, S 2 , sm- 
More specifically, the typical assumption is that the linear 
combinations contribute to the means of Xi,...,XAr and 
the dispersion around the mean is spherical and follows a 
Gaussian distribution 

p{xi\siS2...SM) = i = (2) 

where the vector s contains all the source values = 
[siS 2 ...sm] and is the ith column of the X x M co¬ 
efficient matrix A = [aia 2 ...aM] 121 • More compactly 
p(x|s) = J\f A^s^cF‘^I]sf), where = [xiX 2 ...XAr]. The 


sources’ pdfs p(si),p(s 2 ), •••^p{sm) can follow various dis¬ 
tributions that go from uniform to laplacian ||5| . Typically for 
the model to be identifiable, the sources cannot be Gaussian 
(except perhaps for one out of M). 

Unfortunately when ICA is used as a generative model it 
is hard to produce realistic images even when experimental 
densities are used as density sources 0. Structured patches 
are easy to obtain, but they do not resemble the complex struc¬ 
tures found in natural images. The reason is that independent 
continuous sources do not carry the necessary structure to as¬ 
semble the ICA into the complex structures found in natural 
images. We report a simulation in the following that seems 
to confirm these results. Attempts have been made to use the 
ICA in two-layer architectures 0 However, it is not clear 
how to properly include non linearities (without non lineari¬ 
ties the whole system would still be linear) and investigations 
in this direction are still in progress. 



Fig. 3. The DICA model as a factor graph in reduced nor¬ 
mal form. The shaded boxes represent the fixed matrices 
P{SiS 2 >->SM\Si), i = 1,..., M. The unshaded boxes repre¬ 
sent the conditional probability matrices P{Xj\SiSi...SM), 


2.2. Discrete ICA 

In this work we experiment on the unconstrained ICA model 
with discrete variables. More specifically we assume that 
both sources and visible variables take values in the finite dis¬ 
crete alphabets 5i,52,.... ,5 m, Xi, X 2 ,...., Xat, with sizes 
|>Si|,|52|,....,|5M|and|A’i|,|A’2|,....,|A’Ar|. 

The difficulties in dealing with such a model are clearly 
related to the computational complexity in the manipulation 
of the product space S = 5i x ^2 x ... x Sm, that has 
size |5| = |5i||52| • • • \Sm\ (Figure]^. However, we find 
that even limiting our attention to small dimensionalites, i.e. 
























to few source variables and to small alphabets, the frame¬ 
work applied to natural images reveals quite interesting re¬ 
sults. Furthermore, the basic architecture can be used as a 
building block for more complicated multi-layer Bayesian ar¬ 
chitectures (not discussed in this paper). 

3. DICA IN REDUCED NORMAL FORM 

Probability propagation and learning for the graph of Figure 
[2 can be handled in a very flexible way if we transform the 
model into a factor graph as in Figure [3] The graph is in 
the so-called reduced normal form (see 1191 and references 
therein), that is composed only of one-to-one blocks, source 
blocks and diverters (these are equal constraint blocks that 
act like buses for belief propagation). One-to-one blocks are 
characterized by a conditional probability matrix and sources 
by a probability vector. We have often advocated the use of 
such a representation because it can be handled as a block di¬ 
agram and it is amenable to distributed implementations. We 
have also designed a Simulink library for rapid prototyping 

cni. 

More specifically for the DICA model, the source vari¬ 
ables, that have prior distributions Ft^^, ... Ft 5 ^, are mapped 
to the product space via the fixed row-stochastic matrices 
(shaded blocks) 


^ ® ^1^21 ® ® ••• ® IjSMl’ 


P((5i52...5m)('^)|5m) 

\Sm\^T 


(g) 1 


T 

I<52| 


(g) 1 


T 

I<53| 




(3) 

where (g) denotes the Kronecker product, is a iF-dimensional 
column vector with all ones, and Ik is the AT x AT iden¬ 
tity matrix. The conditional probability matrix is such that 
each variable contributes to the product space with its value 
and it is uniform on the components that compete to the 
other source variables. The blocks at the bottom of Figure 
[^represent the |5'| x \Xj\ conditional probability matrices 
P{Xj\SiS 2 ---Sm ), j = 1 , • • •, A(, that with the source prior 
distributions are typically learned from data. Information 
flows in the network bi-directionally: for each branch vari¬ 
able there is a forward (/) and a backward (b) message, which 
are (or proportional to) discrete probability vectors. Messages 
are usually kept normalized for numerical stability. The vari¬ 
ables connected to the diverter represent a replicated version 
of the same variable, but they all carry different forward and 
backward messages that are combined with the product rule 
CD. Propagation through each one-to-one block follows the 
sum rule which in the variable direction is the matrix multi¬ 
plication font = P{out\in)^fin (already normalized) and in 


the opposite direction = P{out\in)bout and bin = 
(normalization). After propagation for a number of steps 
equal to the graph diameter (if there are no loops), posterior 
probability p for a variable branch can be computed with the 
normalized product p = (© denotes the element-by- 

element product of two vectors). For the reader not familiar 
with this framework, it should be emphasized that these 
simple rules are rigorous translation of marginalization and 
Bayes’ theorem (m. 

4. INFERENCE IN THE DICA GRAPH 

The flexibility of this framework allows the use of the factor 
graph of Figure in various inference modes. Information 
fiow is bi-directional and assuming that all the parameters 
have been learned and that the unspecified messages are ini¬ 
tialized to uniform distributions, we can use the DICA graph 
in: 

(1) Generation: Source values are picked and are injected as 
forward delta distributions at S'!, £' 2 ,..., Sm • After three steps 
of message propagation, the forward distributions are col¬ 
lected at the terminal variables Ai, X 2 ,..., Aat. They are the 
(soft) decoded version of the source values. Note that these 
are distributions that are typically displayed as their means or 
their argmaxes (see simulation results in the following). 

(2) Encoding: Observed values for Ai, A 2 ,..., are in¬ 

jected as delta backward distributions at the bottom. After 
three steps of message propagation, the backward distribu¬ 
tions are multiplied with the forward at S'!, £' 2 ,..., Sm • The 
normalized result is a (soft) factorial code of the input. The 
set of argmaxes of these distribution is the MAP decoding of 
the input. 

(3) Pattern completion: Only a subset of values for Ai, A 2 ,... 

, Atv is available (there are erasures). The available values are 
injected at the bottom as delta backward distributions. For 
the missing values uniform densities are usually injected. Af¬ 
ter three steps of message propagation, forward distributions 
are collected at the bottom variables. For the observed vari¬ 
ables the forward-backward products return just the deltas 
on the observations and provides no new information. At 
the unknown variables, the forward distribution is our best 
(soft) knowledge of that variable. Here too the means or the 
argmaxes can be used as a final result. The inference on the 
erasures is the synthesis of the information coming from the 
observations and the priors. 

(4) Error correction: Available values for Ai, A 2 ,..., A^v 
may contain errors. They are presented as backward delta 
distributions at the bottom variables. After three steps of 
message propagation, forward distributions (or their means or 
argmaxes) are collected and used as corrections. No product 
with the backward is applied here because we do not know 
which component is reliable. In a similar scheme the values 
for Ai, A 2 ,..., Ajv may be known softly via distributions that 
are injected at the bottom as backward messages. 








Note that in both (3) and (4) also coded versions of the 
observations are available at the source branches. 

5. LEARNING IN THE DICA GRAPH 

To train the DICA system, we assume that a set of T examples 
is available for the visible variables (xi [n]X 2 [n].. .x at [n]), n = 
1, ...,T (training set). Learning the system matrices for the 
bottom blocks and the vectors for the sources, is performed 
using an EM search. Various algorithms can be used, all 
inspired by a localized maximum likelihood cost function. 
The iterations are confined to each block and use only locally 
available forward and backward messages. Details on the 
learning algorithms for the factor graph in reduced normal 
form have been reported elsewhere and are omitted here for 
space reasons (see lH dD and references therein). 
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Fig. 4. Distribution means generated by the factorial code for 
increasing number of sources (M = 1, 2, 3,4, 8). The bars 
show the learned source priors. 


6. DICA SIMULATIONS 


We report here a full set of simulations on the MNIST data 
set na. We have reduced the images to 28 x 28 binary pix¬ 


els and extracted 500 images as our training set. In a first set 
of experiments we train the architecture of Figure with all 
binary variables: A'j = j = (N = 784); 

Si = i = !,•••, Af, for various number of sources 

M = 1,2, 3,4, 8. During learning the 500 images of the 
training set are presented as backward delta distributions on 
Xi, ...,XAr, one time, with 5 cycles inside each block (the 
maximum likelihood algorithm inside each block is iterative 
El). Therefore for each order M we obtain the conditional 
probability matrices P{Xj | S'!... Sm ), j = 1, • • •, X, and the 
prior distributions tts^ , •••, ttsm • 

Generation: Figure shows, for increasing M, the means 
of /xi, • • •,/xat when at the sources we inject the 2^ bi¬ 
nary configurations in the forward messages ,..., /sm • 
Reported in the picture are also the learned priors. We note 
that, for larger number of sources, the product space (sizes 
2,4,8,16,256), corresponds to increasingly accurate pattern 
memorization. For some characters, that are different in 
shape, the system builds separate representations. The source 
variables, independent by definition (factorial code), learn 
marginal distributions progressively less uniform as the num¬ 
ber of sources increases (recall that the vector that represents 
p(5'i,...., Sm) is the Kronecker product of the individual bi¬ 
nary distributions and that even small non uniformities in the 
priors cause p(5'i, Sm) to be highly non uniform). 
Encoding: Figureshows the typical results of presenting to 
the DICA graph of Figure with M = 8, images from the 
test set (i.e. not included in the 500 images used for training) 
as backward delta distributions at Xi,..., Xat. In the third 
column the posterior distributions at the sources are shown 
(only the probability on the symbol is depicted). Here the 
DICA graph acts as an Encoder: the (soft) binary configura¬ 
tions are the factorial code of the presented images. Note that 
not all the codes are sharp. In the second column the mean of 
the forward distributions at Xi,..., Xat is also shown. 
Decoding: In Figurethe same DICA graph is used as a soft 
decoder when smooth and sharp distributions are injected at 
the sources. 

Pattern completion: Figure shows the results of the same 
network when as backward at Xi,..., Xat we present images 
(from the test set) with 50 % of the pixels removed. For the 
erased pixels a backward uniform distribution is presented. 
The third and the fourth columns report the mean for the for¬ 
ward and the posterior distributions respectively. The network 
fills-in rather well the missing parts. 

6.1. Continuous ICA on the same dataset 

The natural question at this point is whether with continu¬ 
ous ICAs it would be possible to obtain similar results. The 
model is clearly very different, but on the same data set we 
have attempted a comparison. On the 500 MNIST images of 
the training set we have computed ICAs using the Fast ICA 
algorithm available for Matlab ca. We have retained only 
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Fig. 5. Encoding of some images from the test set. Col. 1: im¬ 
ages presented as delta backward distributions. Col. 2: means 
of the forward distributions. Col. 3: posterior probabilities at 
the sources (the bars represent [ps^ (5^)...p5'g(s^)]). 


the first 8 components (largest variance) and estimated the 
output densities using average histograms. Random samples 
from these densities are used to generate the images though 
the inverse ICA ifT^ . Figure shows the 8 masks and some 
generated images. The results confirm that, even if the ICA 
nicely represent bases for the data, with unconstrained inde¬ 
pendent samples at the sources, only average structures are 
generated. We have also tried with larger number of compo¬ 
nents and the obtained images look very similar. These results 
seem to be consistent with other experiments presented in the 
literature IT4l for patches of natural images where only aver¬ 
age textures are obtained. The linear ICA with independent 
unconstrained sources do not seem to be a generative model 
that preserves the structured composition of the training set. 


7. DICA FOR CLASSIFICATION 

The great fiexibility of the factor graph framework allows 
to extend easily the architecture of the DICA graph to the 
one shown in Figure where also a label variable C is in¬ 
cluded. The variable C belong to the finite alphabet C = 
,..., and it is attached directly, through a conditional 

probability matrix P(C|5'i,..., Sm), to the product space di¬ 
verter. Diverters in the reduced normal form act like proba¬ 
bility pipelines m 

Simulations have been performed on the same MNIST 
training set of 500 binarized images in the same mode as in 
the unsupervised experiments with the addition, during train¬ 


[1 00000 1 1 ] [ 0.7500000 1 1 ] [ 0.500000 1 1 |[ 0.25 00 0 00 1 1 ] [0 00000 1 1 ] 



Fig. 6. Decoding for smooth forward distributions at the 
sources (in the brackets the probabilities [fsi(s^)---fss(s^)]) 
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Fig. 7. Pattern completion of images from the test set after 
50% removal. 


ing, of the label information as a backward delta distribu¬ 
tion. All the blocks, including now the probability matrix 
P{C\Si, Sm), are trained for M = 8. On the learned 
network, a typical recognition task on two images from the 
test set is shown in Figure The bar graph represents si¬ 
multaneously classification and encoding. Note how in the 
first row the network is naturally confused between and c^. 

A generative experiment is also performed on this archi¬ 
tecture with backward delta distributions injected at C. The 
results are shown in Figure The images are the mean for¬ 
ward distributions at Xi,..., Xat and could be considered as 
the prototypes for the ten labels. The bar graphs are the cor¬ 
responding simultaneous encoding at the sources. 




















































bx 


fx 


fc 


m 

0 

1 


M 



□ 

B 

(a) 

e 

i 


0 

0 


Q 



(b) 


Fig. 8. Continuous ICA comparison: (a) 8 ICA masks for 
the Training Set (b) 8 generated images using at the sources 
random values drawn from estimated output histograms. 
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Fig. 10. Recognition task on two images from the test set 
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Fig. 9. The Die A model for classification 

8. CONCLUSIONS 

The simulations on the MNIST dataset with binary sources 
show that belief propagation in the DICA architecture, also 
with the addition of the label variable, provides a unified 
framework in which image data can be coded, generated and 
corrected in a very fiexible way. We have also experimented 
on natural images on quantized patches obtaining very simi¬ 
lar results, also when the sources have alphabet sizes greater 
than two. These results will be reported elsewhere. We are 
currently pursuing the use of this framework for building 
multi-layer architectures. 
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